All the group member participated in all the two assignments, and after discussion, formed this report.
#read the data
olive<-read.csv("olive.csv", row.names=1)
#draw the scatter plot,colored by original linoleic values
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
#discretized linoleic into four classs and plot
linoleic_intervals<-cut_interval(x=olive$linoleic,n=4)
#draw the scatter plot,colored by discretized linoleic values
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
In the first figure using the continuous linoleic data, it is difficult to distinguish the hue by human perception system. However,When using discrete variables. It is easier to recognize the data belonging to which class.
#change the color based on the figure in question 1
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")+
scale_color_manual(values = c("red", "blue", "green","orange"))
scatterplot
#change the size based on the figure in question 1
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,size=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
## Warning: Using size for a discrete variable is not advised.
#change the size based on the figure in question 1
angel_olive<-olive%>%mutate(angle=runif(nrow(olive), 0, 2*pi))
scatterplot<-ggplot(data=angel_olive,aes(x=palmitic,y=oleic))+
geom_point()+
geom_spoke(aes(angle=angle),radius=45)+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
The plot using different colors is the easiest to distinguish the categories, The next is the plot using orientation angle. The hardest one to differentiate between categories is the plot using different sizes due to many data points are overlapping. Connect to perception metrics: color(hue 10 levels, 3.1 bits), line orientation(3 bits),size(2.2 bits). It also shows color is the easiest one to perceive. The level of feature we can perceive is 8 levels which is equal to 3 bits.
#draw the scatter plot,colored by numeric value of region
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,y=eicosenoic,color=Region))+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
#draw the scatter plot,colored by categorical value of region
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,y=eicosenoic,color=as.factor(Region)))+
ggtitle("dependence of oleic on eicosenoic")+
labs(color="Region")
scatterplot
The scatter plot based on the numeric value of the region is a little bit difficult to identify the decision boundaries. As in this case, the region values are considered as continuous values. They have the same color but different brightness. So the Region value should be discretized. In the second plot, we can quickly identify the boundaries. Region is a categorical variable now. The plot is according to the three categories. In the latter case, preattentive mechanisms make it possible. The preattentive feature is hue.
#draw the scatter plot
#colored by categorical value of linoleic
#shape is defined by a discretized Palmitic (3 classes)
#size is defined by a discretized Palmitoleic (3 classes)
linoleic_3intervals<-cut_interval(x=olive$linoleic,n=3)
palmitic_3intervals<-cut_interval(x=olive$palmitic,n=3)
palmitoleic_3intervals<-cut_interval(x=olive$palmitoleic,n=3)
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,
y=eicosenoic,
color=linoleic_3intervals,
shape=palmitic_3intervals,
size=palmitoleic_3intervals))+
labs(color="linoleic",shape="palmitic",size="palmitoleic")+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
## Warning: Using size for a discrete variable is not advised.
This figure contains too much information to display. It is hard to distinguish different points, especially the points with different size and shape. Many points are overlapping. This figure shows combining many metrics does not sum up the capacity.
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,
y=eicosenoic,
color=as.factor(Region),
shape=palmitic_3intervals,
size=palmitoleic_3intervals))+
labs(color="Region",shape="palmitic",size="palmitoleic")+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
## Warning: Using size for a discrete variable is not advised.
Treisman’s theory shows the figure is processed in parallel by checking the individual feature maps and combining the features takes some time. In this case, A specific preattentive feature hue is processed, and it is easy to see the boundaries based on the colour. However, in conjunction with the shape and size, it requires a lot of effort.
#create a pie chart using plotly
count_area<-olive%>%group_by(Area)%>% summarize( count=n())
proportion<-count_area$count/sum(count_area$count)*100
pie_chart<-plot_ly(data=count_area,labels=~Area,values=~proportion,textinfo = "none")%>%
add_pie()%>% layout(title = "proportions of oil pie chart",showlegend=FALSE)
pie_chart
As the labels are hidden and only hover_on values are kept. It is a little difficult to distinguish the proportions especially for those whose proportions are similar. And it is also not so convenient if we want to know some values. We have to hover on the cursor again and again to get the proportions.
# contour plot
contour_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
geom_density_2d()
contour_plot
#scatter plot
scatter_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
geom_point()
scatter_plot
The scatter plot based on the two variables shows that the observations can be divided into two groups. However, it is difficult to see the groups in the contour plot. The boundary is not clear.
# Read data
q2data <- read_xlsx("baseball-2016.xlsx")
# Check the range of the below variables
print(c(max(q2data$HR),min(q2data$HR)))
## [1] 253 122
print(c(max(q2data$RBI),min(q2data$RBI)))
## [1] 836 575
print(c(max(q2data$OBP),min(q2data$OBP)))
## [1] 0.348 0.299
Here, three variables, HR (Home Runs), RBI (Runs batted in), and OBP (On Base Percentage), are shown to check the range of different variables. It is observed that there is significant variation in the range of these different variables. Therefore, it is reasonable to scale these data before performing multidimensional scaling (MDS).
#Using code template from course website
q2data.numeric= scale(q2data[,3:28])
d = dist(q2data.numeric, method = "minkowski")
res=isoMDS(d,k=2)
## initial value 19.856833
## iter 5 value 16.319153
## iter 10 value 16.046215
## final value 15.935476
## converged
coords=res$points
q2MDS=as.data.frame(coords)
q2MDS$League=q2data$League
q2MDS$Team=q2data$Team
plot_22 <- plot_ly( data = q2MDS, x = q2MDS[,1], y = q2MDS[,2],color =q2MDS$League,
colors = c("red","black") ,type = "scatter", mode = "markers", text = q2MDS$Team)
plot_22
By observing the scatter plot, it can seen that the majority teams
form AL league are withinthe range of -2.6 to 3.6 on the x-axis and
above -1.26 on the y-axis, with only 3 teams form NL league are in this
region. The y-axis(2nd MDS component) is the best differentiate two
leagues, since both league span similarly on x-axis. By using syntax
text = q2MDS$Team, one can check the team name on different
data point. The Boston Red Sox seems to be the outlier in this context
as it is the only AL league team that is outside the aforementioned
range.
#Using code template from course website
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(q2data)[index1],
'<br> Obj 2: ', rownames(q2data)[index2]))%>%
#if nonmetric MDS inolved
add_lines(x=~sh$x, y=~sh$yf)
#Check stress of MDS
print(res$stress)
## [1] 15.93548
There are two ways to check how successful the MDS performed. The
first one is directly observed in the Shepard plot, if the line is
diagonal and dots are aligned with it, the MDS can be considered
successful. In the above case, as the line is not a diagonal straight
line and the dots are scattered, it can be seen that the MDS performance
is not ideal. The second method is to check the “stress”(the goodness of
fit, which is calculated from residuals around the line), the lower the
better, and with the stress = 15.9354757. This supports the conclusion
that MDS does not perform ideally.
By hovering on the dots that has the longer distance between the line.
Two pairs of data point are observed, <Obj1: 20 Obj2: 16> and
<Obj1: 17 Obj2: 1>, which represent <Oakland
Athletics,Milwaukee Brewers> and <Minnesota Twins, Aizona
Diamondbacks> respectively. These two pairs are hard for the MDS to
map successfully.
Q24df <- cbind(q2data,res$points[,2])
Q24_plot_function <- function(i){
ggplot(data = Q24df)+geom_point(aes(x=Q24df[, 29],y=Q24df[,i]))+
xlab("MDS variable 2") +
ylab("") +
ggtitle(colnames(Q24df[i]))
}
plot_list <- lapply(seq(3,28,1),Q24_plot_function)
grid_plot <- grid.arrange(grobs=plot_list,
top=("MDS variable 2 against all other numerical variables"),ncol=4)
By observing the scatter plots of the second MDS variable and
numerical variables. The two variable that have the strongest connection
are HR.per.game(Home Runs per Game) and HR(Home Runs), both are
positive.
These two variables are related to each other, and both are important
statistics in scoring the baseball teams. The term, Home Run, means a
hit that allows the batter to make a complete circuit of the bases and
score a run, which means a home run can guarantee at least one point and
sometimes more for the team. Hence, a team with higher HR.per.game and
HR can be consider as a team have higher scoring potential and better
team.
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)
library(MASS)
library(gridExtra)
#read the data
olive<-read.csv("olive.csv", row.names=1)
#draw the scatter plot,colored by original linoleic values
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
#discretized linoleic into four classs and plot
linoleic_intervals<-cut_interval(x=olive$linoleic,n=4)
#draw the scatter plot,colored by discretized linoleic values
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
#change the color based on the figure in question 1
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,color=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")+
scale_color_manual(values = c("red", "blue", "green","orange"))
scatterplot
#change the size based on the figure in question 1
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=palmitic,y=oleic,size=linoleic_intervals))+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
#change the size based on the figure in question 1
angel_olive<-olive%>%mutate(angle=runif(nrow(olive), 0, 2*pi))
scatterplot<-ggplot(data=angel_olive,aes(x=palmitic,y=oleic))+
geom_point()+
geom_spoke(aes(angle=angle),radius=45)+
ggtitle("dependence of Palmitic on Oleic")
scatterplot
#draw the scatter plot,colored by numeric value of region
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,y=eicosenoic,color=Region))+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
#draw the scatter plot,colored by categorical value of region
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,y=eicosenoic,color=as.factor(Region)))+
ggtitle("dependence of oleic on eicosenoic")+
labs(color="Region")
scatterplot
#draw the scatter plot
#colored by categorical value of linoleic
#shape is defined by a discretized Palmitic (3 classes)
#size is defined by a discretized Palmitoleic (3 classes)
linoleic_3intervals<-cut_interval(x=olive$linoleic,n=3)
palmitic_3intervals<-cut_interval(x=olive$palmitic,n=3)
palmitoleic_3intervals<-cut_interval(x=olive$palmitoleic,n=3)
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,
y=eicosenoic,
color=linoleic_3intervals,
shape=palmitic_3intervals,
size=palmitoleic_3intervals))+
labs(color="linoleic",shape="palmitic",size="palmitoleic")+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
scatterplot<-ggplot(data=olive)+
geom_point(aes(x=oleic,
y=eicosenoic,
color=as.factor(Region),
shape=palmitic_3intervals,
size=palmitoleic_3intervals))+
labs(color="Region",shape="palmitic",size="palmitoleic")+
ggtitle("dependence of oleic on eicosenoic")
scatterplot
#create a pie chart using plotly
count_area<-olive%>%group_by(Area)%>% summarize( count=n())
proportion<-count_area$count/sum(count_area$count)*100
pie_chart<-plot_ly(data=count_area,labels=~Area,values=~proportion,textinfo = "none")%>%
add_pie()%>% layout(title = "proportions of oil pie chart",showlegend=FALSE)
pie_chart
# contour plot
contour_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
geom_density_2d()
contour_plot
#scatter plot
scatter_plot<-ggplot(olive,aes(x=linoleic, y=eicosenoic))+
geom_point()
scatter_plot
# Read data
q2data <- read_xlsx("baseball-2016.xlsx")
# Check the range of the below variables
print(c(max(q2data$HR),min(q2data$HR)))
print(c(max(q2data$RBI),min(q2data$RBI)))
print(c(max(q2data$OBP),min(q2data$OBP)))
#Using code template from course website
q2data.numeric= scale(q2data[,3:28])
d = dist(q2data.numeric, method = "minkowski")
res=isoMDS(d,k=2)
coords=res$points
q2MDS=as.data.frame(coords)
q2MDS$League=q2data$League
q2MDS$Team=q2data$Team
plot_22 <- plot_ly( data = q2MDS, x = q2MDS[,1], y = q2MDS[,2],color =q2MDS$League,
colors = c("red","black") ,type = "scatter", mode = "markers", text = q2MDS$Team)
plot_22
#Using code template from course website
sh <- Shepard(d, coords)
delta <-as.numeric(d)
D<- as.numeric(dist(coords))
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n)
index1=as.numeric(index[lower.tri(index)])
n=nrow(coords)
index=matrix(1:n, nrow=n, ncol=n, byrow = T)
index2=as.numeric(index[lower.tri(index)])
plot_ly()%>%
add_markers(x=~delta, y=~D, hoverinfo = 'text',
text = ~paste('Obj1: ', rownames(q2data)[index1],
'<br> Obj 2: ', rownames(q2data)[index2]))%>%
#if nonmetric MDS inolved
add_lines(x=~sh$x, y=~sh$yf)
#Check stress of MDS
print(res$stress)
Q24df <- cbind(q2data,res$points[,2])
Q24_plot_function <- function(i){
ggplot(data = Q24df)+geom_point(aes(x=Q24df[, 29],y=Q24df[,i]))+
xlab("MDS variable 2") +
ylab("") +
ggtitle(colnames(Q24df[i]))
}
plot_list <- lapply(seq(3,28,1),Q24_plot_function)
grid_plot <- grid.arrange(grobs=plot_list,
top=("MDS variable 2 against all other numerical variables"),ncol=4)